In this lab you’ll have the chance to go over a number of things that we covered before reading week. We will focus on data wrangling, data subsetting, data visualisation, and building some linear models.

We will be using two sets of packages - the tidyverse packages which includes dplyr, ggplot2 etc. and the gapminder package which includes some nice global data related to population size, life expectancy, GDP etc for a bunch of countries over a number of years.

First of all, load the tidyverse packages. Remember, if you haven’t installed either before, you’ll need to first type install.packages("packagename") in the console. Obviously, replace “packagename” with the name of the package you want to download!

library(tidyverse)

First we’re going to look at joining together two datasets. This is the code from the Week 3 lecture where we habd two datasets - one (called data1.csv contains dta for 10,000 people - each with measures of Working Memory (WM), IQ, and reading Comprehension (Comp)). The second (called dataRT.csv) contains the reading time data for 48 of these 10,000 people. The reading time measures are associated with reading Complex and Simple sentences.

We begin by reading in the data…

data1 <- read_csv("data_files/data1.csv")

dataRT <- read_csv("data_files/dataRT.csv")

Let’s look at the first 6 rows of each dataframe using the head() function.

head(data1)
## # A tibble: 6 x 4
##      id    wm    iq  comp
##   <dbl> <dbl> <dbl> <dbl>
## 1     1    43    72    16
## 2     2    51   109    18
## 3     3    55   107    18
## 4     4    38   102    20
## 5     5    52   121    17
## 6     6    52    92    16
head(dataRT)
## # A tibble: 6 x 3
##      id simple_sentence complex_sentence
##   <dbl>           <dbl>            <dbl>
## 1  1138            1902             2341
## 2  6223            1797             2503
## 3  6092            2080             2731
## 4  6232            1856             2375
## 5  8606            1997             2177
## 6  6400            1868             2284

To find out how big each dataset is, we can use the str() function.

str(data1)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of  4 variables:
##  $ id  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ wm  : num  43 51 55 38 52 52 47 47 47 45 ...
##  $ iq  : num  72 109 107 102 121 92 68 97 93 101 ...
##  $ comp: num  16 18 18 20 17 16 21 23 22 17 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   wm = col_double(),
##   ..   iq = col_double(),
##   ..   comp = col_double()
##   .. )
str(dataRT)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 48 obs. of  3 variables:
##  $ id              : num  1138 6223 6092 6232 8606 ...
##  $ simple_sentence : num  1902 1797 2080 1856 1997 ...
##  $ complex_sentence: num  2341 2503 2731 2375 2177 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   simple_sentence = col_double(),
##   ..   complex_sentence = col_double()
##   .. )

You’ll see that the column id is common to the two dataframes - participant 1138 in data1 is the same participant referred to as 1138 in dataRT.

We can join the two datafiles together matched by id so we end up with one datafile of 48 individuals with their individual difference measures and their reading times. We assign this to a new variable we’re calling dataRT_all

dataRT_all <- inner_join(data1, dataRT, by = "id")

We’re now going to use the mutate function to create two new columns - one corresponding to the log of the Simple reading time condition, and the other to the log of the Complex reading time condition. We map this onto a new variable we’re calling data_transformed.

data_transformed <- mutate(dataRT_all, log_simple = log(simple_sentence), 
                            log_complex = log(complex_sentence))

We can inspect the first 6 elements of this new dataframe by using the head() function.

head(data_transformed)
## # A tibble: 6 x 8
##      id    wm    iq  comp simple_sentence complex_sentence log_simple
##   <dbl> <dbl> <dbl> <dbl>           <dbl>            <dbl>      <dbl>
## 1    95    47    94    19            2154             2441       7.68
## 2   400    45   118    18            1824             2456       7.51
## 3   457    42   100    22            1857             2324       7.53
## 4  1138    41    77    18            1902             2341       7.55
## 5  1587    54    67    21            1844             2320       7.52
## 6  1805    52   109    19            2224             2256       7.71
## # … with 1 more variable: log_complex <dbl>

We can then re-shape the data from wide to long format using the gather() function - we map this onto a new variable we’re calling data_long.

data_long <- gather(dataRT, "condition", "rt", c("simple_sentence", "complex_sentence"))

Let’s look at the first 6 rows…

head(data_long)
## # A tibble: 6 x 3
##      id condition          rt
##   <dbl> <chr>           <dbl>
## 1  1138 simple_sentence  1902
## 2  6223 simple_sentence  1797
## 3  6092 simple_sentence  2080
## 4  6232 simple_sentence  1856
## 5  8606 simple_sentence  1997
## 6  6400 simple_sentence  1868

OK, so now we’ve had a go at joining two datasets and then re-shaping the resulting dataset from wide to long format.

Let’s move onto the next activity. We’re going to use the gapminder dataset.

library(gapminder)

First we’re going to build a scatterplot of life expectancy against GDP per capita using the gapminder dataset. If you want to explore the dataset before you use it you could type >help(gapminder) or >str(gapminder)

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  labs(title = "Scatterplot of Life Expectancy against GDP per capita",
       x = "GDP per capita", 
       y = "Life Expectancy (years)")

Now let’s separate out the data by Contintent.

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
  geom_point() +
  labs(title = "Scatterplot of Life Expectancy against GDP per capita by Continent",
       x = "GDP per capita", 
       y = "Life Expectancy (years)",
       colour = "Continent")

We can also use facet_wrap() to split by Continent which might make things a little easier to see.

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
  geom_point() + 
  facet_wrap(~ continent) +
  guides(colour = FALSE) +
  labs(title = "Scatterplot of Life Expectancy against GDP per capita by Continent",
       x = "GDP per capita", 
       y = "Life Expectancy (years)",
       colour = "Continent")  

Now we’re going to focus on just one country, Sweden - famous for pop pioneers, Abba, and everyone’s favourite progressive death metal band, Opeth. No? Just mine then…

We will use the filter() function from the dplyr package to include only data from “Sweden” in our new variable I’m calling gapminder_sub1.

gapminder_sub1 <- filter(gapminder, country == "Sweden")

Let’s plot a scatterplot of life expectancy against GDP and add a regression line.

ggplot(gapminder_sub1, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Sweden Life Expectancy by GDP",
       x = "GDP per capita", 
       y = "Life Expectancy (years)")

We can formally calculate the linear model using the lm() function.

model <- lm(lifeExp ~ gdpPercap, data = gapminder_sub1)
summary(model)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder_sub1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6476 -0.3361  0.0124  0.1841  1.1721 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.853e+01  4.410e-01  155.38  < 2e-16 ***
## gdpPercap   3.834e-04  2.074e-05   18.49 4.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5311 on 10 degrees of freedom
## Multiple R-squared:  0.9716, Adjusted R-squared:  0.9687 
## F-statistic: 341.9 on 1 and 10 DF,  p-value: 4.618e-09

We can look at some diagnostic plots to make sure everything looks ok.

plot(model)

Generally things look fine although point 12 is just outside .5 of Cook’s distance. Now we’re going to do the same for the UK.

gapminder_sub2 <- filter(gapminder, country == "United Kingdom")
ggplot(gapminder_sub2, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "UK Life Expectancy by GDP",
       x = "GDP per capita", 
       y = "Life Expectancy (years)")

model <- lm(lifeExp ~ gdpPercap, data = gapminder_sub2)
summary(model)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder_sub2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75546 -0.29292 -0.03036  0.18865  0.99229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.515e+01  4.231e-01  153.96  < 2e-16 ***
## gdpPercap   4.527e-04  2.051e-05   22.07 8.17e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5026 on 10 degrees of freedom
## Multiple R-squared:  0.9799, Adjusted R-squared:  0.9779 
## F-statistic: 487.2 on 1 and 10 DF,  p-value: 8.166e-10
plot(model)

In this case, point 12 is looking a little extreme.

Now, we’re going to combine our two dataframes using the rbind() function. This function will take two separate dataframes and bind them together by rows (i.e., one on top of each other) to produce a new dataframe. You can also combine two dataframes side by side by binding together by columns using the cbind() function. Here we’re sticking with rbind() and we’re mapping the output of this onto a new variable I’m calling data_combined.

data_combined <- rbind(gapminder_sub1, gapminder_sub2)

We’re now going to produce a different kind of plot. We’re going to build a violin plot showing the distribution of life expectancy data for our two countries. We’re also adding some descriptive statistics using the stat_summary() function - we’re adding the mean and SE bars.

ggplot(data_combined, aes(x = country, y = lifeExp, fill = country)) + 
  geom_violin() + 
  geom_jitter(width = .1, alpha = .3) + 
  stat_summary(fun.data = mean_se, geom = "errorbar", width=.25) + 
  stat_summary(fun.y = mean, geom = "point", size = 4) + 
  guides(fill = FALSE)  +
  labs(title = "Life Expectancy for Sweden and the United Kingdom",
       x = "Country", 
       y = "Life Expectancy (years)")

Now let’s look at the countries in Europe using our filter() function again and life expectancy against GDP.

Instead of creating temporary variables, we can use the pipe operator %>% (read as “and then”) which allows us to filter the dataset before we pass it to ggplot(). In this case, we’re selecting only the Europe continent in our dataset.

gapminder %>% 
  filter(continent == "Europe") %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() +
  labs(title = "Life Expectancy against GDP per capita \nfor countries in Europe",
       x = "GDP per capita", 
       y = "Life Expectancy (years)")

Now we’re going to plot boxplots of the Life Expectacy data for all the countries in Europe. Notice how I’m using the fct_reorder() function to reorder the levels of our factor based on the median of each country’s Life Expectancy data.

gapminder %>% 
  filter(continent == "Europe") %>%
ggplot(aes(x = fct_reorder(country, lifeExp, median), y = lifeExp, fill = country)) + 
  geom_boxplot() + 
  coord_flip() + 
  guides(fill = FALSE) + 
  labs(title = "Boxplots of Life Expectancy in Countries in Europe over Time",
       x = "Country",
       y = "Life Expectancy (years)")

We’re now going to look at a dataset built into the tidyverse - it’s the diamonds dataset and contains the prices and other attributes of almost 54,000 diamonds. Let’s check the structure first using str()

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Let’s first plot Price against Clarity while also capturing Carat. Remember the alpha parameter in the geom_jitter() layer sets the level of translucency of each point - it can vary from 0 to 1. Change it from .1 to 1 to see what happens…

ggplot(diamonds, aes(x = clarity, y = price, colour = carat)) + 
  geom_jitter(alpha = .1) + 
  labs(x = "Clarity (Worst to Best)", 
       y = "Price in USD", 
       colour = "Carat") 

Now let’s plot a histogram of diamond prices.

ggplot(diamonds, aes(x = price)) + 
  geom_histogram(bins = 50) + 
  labs(title = "Histogram of Diamond Prices", 
       x = "Price in USD")

We’re now going to look at histogrames of diamond prices separately for each type of diamond ‘cut’. Notice we’re also plotting in grey the histogram of the overall dataset. For the first geom_histogram() call we’ve using a filtered version of our dataset using the select() function to allow us to plot this background. The - means we’re dropping the variable cut from the dataset diamonds. The next geom_histogram() call uses the unfiltered dataset inherited from ealier in the pipe - we then facet_wrap() by the variable labelled cut.

diamonds %>%
  ggplot(aes(x = price)) + 
  geom_histogram(data = select(diamonds, -cut), bins = 50, fill = "grey") + 
  geom_histogram(bins = 50) + 
  facet_wrap(~ cut) + 
  labs(title = "Histograms of Diamond Prices Faceted by Diamond Cut",
       x = "Price in USD")

Now let’s look at the whole data set again with Price plotted against type of diamond ‘cut’. We’re also adding some descriptive statistics using the stat_summary() function.

diamonds %>%
  ggplot(aes(x = cut, y = price, colour = carat)) + 
  geom_jitter(alpha = .1) +
  stat_summary(fun.data = mean_cl_boot, colour="red") +
  labs(x = "Diamond Cut", y = "Price in USD", colour="Carat") 

Let’s work out some descriptives ourselves using group_by() and summarise() from the dplyr package, alongside a bit of piping %>%. We’re working out (grouped by cut) the mean and standard deviations for price and carat.

diamonds %>% 
  group_by(cut) %>% 
  summarise(mean_price = mean(price), sd_price = sd(price), count = n(), mean_carat = mean(carat),
            sd_carat = sd(carat))
## # A tibble: 5 x 6
##   cut       mean_price sd_price count mean_carat sd_carat
##   <ord>          <dbl>    <dbl> <int>      <dbl>    <dbl>
## 1 Fair           4359.    3560.  1610      1.05     0.516
## 2 Good           3929.    3682.  4906      0.849    0.454
## 3 Very Good      3982.    3936. 12082      0.806    0.459
## 4 Premium        4584.    4349. 13791      0.892    0.515
## 5 Ideal          3458.    3808. 21551      0.703    0.433

Let’s choose a mid-ranking clarity level and smaller diamonds (< .5 carats). We use the filter() function to do that. Can a diamond’s price be predicted by its carat level?

diamonds %>% 
  filter(clarity == "VS2" & carat < .5) %>%
  ggplot(aes(x = carat, y = price)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Price in USD against Carats for \nMid-Range Clarity Small Diamonds",
       x = "Carats", 
       y = "Price in USD")

The graph certainly suggests we can predict price if we know a diamond’s carat level. Is this supported by a linear model?

model <- lm(price ~ carat, data = filter(diamonds, clarity == "VS2" & carat < .5))
summary(model)
## 
## Call:
## lm(formula = price ~ carat, data = filter(diamonds, clarity == 
##     "VS2" & carat < 0.5))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -377.84 -104.68   -3.57  121.74 1016.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -155.47      16.07  -9.675   <2e-16 ***
## carat        2668.40      46.51  57.372   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 143.2 on 4084 degrees of freedom
## Multiple R-squared:  0.4463, Adjusted R-squared:  0.4461 
## F-statistic:  3292 on 1 and 4084 DF,  p-value: < 2.2e-16
plot(model)

Looks like we have a pretty nice fitting model with only one or two outliers.